{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> [Kaggle入门教程2-改进特征](http://www.cnblogs.com/cutd/p/5713525.html)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 改进模型\n",
    "## 使用sklearn随机森林拟合非线性成分\n",
    "- 用储存在**alg**的随机森林算法去做交叉验证。用predictions去预测Survived列。将结果赋值到**scores**。\n",
    "- 使用**model_selection.cross_val_score**来完成这些。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "titanic = pd.read_csv('Data/train.csv')\n",
    "# 处理Age中的缺失值\n",
    "titanic['Age'] = titanic['Age'].fillna(titanic['Age'].median())\n",
    "# 量化Sex\n",
    "titanic.loc[titanic['Sex'] == 'male', 'Sex'] = 0\n",
    "titanic.loc[titanic['Sex'] == 'female', 'Sex'] = 1\n",
    "# 量化Embarked\n",
    "titanic['Embarked'] = titanic['Embarked'].fillna('S')\n",
    "titanic.loc[titanic['Embarked'] == 'S', 'Embarked'] = 0\n",
    "titanic.loc[titanic['Embarked'] == 'C', 'Embarked'] = 1\n",
    "titanic.loc[titanic['Embarked'] == 'Q', 'Embarked'] = 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.model_selection import cross_val_score\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "\n",
    "predictors = [\"Pclass\", \"Sex\", \"Age\", \"SibSp\", \"Parch\", \"Fare\", \"Embarked\"]\n",
    "\n",
    "# Initialize our algorithm with the default paramters\n",
    "# n_estimators is the number of trees we want to make\n",
    "# min_samples_split is the minimum number of rows we need to make a split\n",
    "# min_samples_leaf is the minimum number of samples we can have at the place where a tree branch ends (the bottom points of the tree)\n",
    "alg = RandomForestClassifier(random_state=1, n_estimators=10, min_samples_split=2, min_samples_leaf=1)\n",
    "# Compute the accuracy score for all the cross validation folds.  (much simpler than what we did before!)\n",
    "scores = cross_val_score(alg, titanic[predictors], titanic[\"Survived\"],cv=3)\n",
    "\n",
    "# Take the mean of the scores (because we have one for each fold)\n",
    "print(scores.mean())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 调参\n",
    "- 增加我们使用的树的数量会很大的提升预测的准确率，训练更多的树会花费更多的时间\n",
    "- 调整**min_samples_split**和**min_samples_leaf**变量来减少过拟合"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "alg = RandomForestClassifier(random_state=1, n_estimators=150, min_samples_split=4, min_samples_leaf=2)\n",
    "# Compute the accuracy score for all the cross validation folds.  (much simpler than what we did before!)\n",
    "scores = cross_val_score(alg, titanic[predictors], titanic[\"Survived\"], cv=3)\n",
    "\n",
    "# Take the mean of the scores (because we have one for each fold)\n",
    "print(scores.mean())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 生成新特征\n",
    "- 一些点子：\n",
    "  - 名字的长度——这和那人有多富有，所以在泰坦尼克上的位置有关。\n",
    "  - 一个家庭的总人数(**SibSp**+**Parch**)。\n",
    "- 使用pandas数据框的**.apply**方法来生成特征。这会对你传入数据框(dataframe)或序列(series)的每一个元素应用一个函数。我们也可以传入一个**lambda**函数使我们能够定义一个匿名函数。\n",
    "- 一个匿名的函数的语法是**lambda x:len(x)**。x将传入的值作为输入值——在本例中，就是乘客的名字。表达式的右边和返回结果将会应用于x。**.apply**方法读取这些所有输出并且用他们构造出一个pandas序列。我们可以将这个序列赋值到一个数据框列。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Generating a familysize column\n",
    "titanic[\"FamilySize\"] = titanic[\"SibSp\"] + titanic[\"Parch\"]\n",
    "\n",
    "# The .apply method generates a new series\n",
    "titanic[\"NameLength\"] = titanic[\"Name\"].apply(lambda x: len(x))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 使用头衔\n",
    "- 从乘客的名字中提取出他们的头衔。头衔的格式是**Master.,Mr.,Mrs.**，以一个大写字母开头，后面是小写字母，最后以.结尾。有一些非常常见的头衔，也有一些“长尾理论”中的一次性头衔只有仅仅一个或者两个乘客使用。第一步使用**正则表达式**提取头衔，然后将每一个唯一头衔匹配成(map)整型数值。\n",
    "- 之后我们将得到一个准确的和Title相对应的数值列。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import re\n",
    "\n",
    "# A function to get the title from a name.\n",
    "def get_title(name):\n",
    "    # Use a regular expression to search for a title.  Titles always consist of capital and lowercase letters, and end with a period.\n",
    "    title_search = re.search(' ([A-Za-z]+)\\.', name)\n",
    "    # If the title exists, extract and return it.\n",
    "    if title_search:\n",
    "        return title_search.group(1)\n",
    "    return \"\"\n",
    "\n",
    "# Get all the titles and print how often each one occurs.\n",
    "titles = titanic[\"Name\"].apply(get_title)\n",
    "print(pd.value_counts(titles))\n",
    "\n",
    "# Map each title to an integer.  Some titles are very rare, and are compressed into the same codes as other titles.\n",
    "title_mapping = {\"Mr\": 1, \"Miss\": 2, \"Mrs\": 3, \"Master\": 4, \"Dr\": 5, \"Rev\": 6, \"Major\": 7, \"Col\": 7, \"Mlle\": 8, \"Mme\": 8, \"Don\": 9, \"Lady\": 10, \"Countess\": 10, \"Jonkheer\": 10, \"Sir\": 9, \"Capt\": 7, \"Ms\": 2}\n",
    "for k,v in title_mapping.items():\n",
    "    titles[titles == k] = v\n",
    "\n",
    "# Verify that we converted everything.\n",
    "print(pd.value_counts(titles))\n",
    "\n",
    "# Add in the title column.\n",
    "titanic[\"Title\"] = titles"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 家族\n",
    "- 生成一个特征来表示哪些人是一个家族。因为幸存看起来非常依靠你的家族和你旁边的人，这是一个成为好特征的好机会。\n",
    "- 通过FamilySize连接某些人的姓来得到一个家庭编号。然后我们将基于他们的家庭编号给每个人赋值一个代码。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import operator\n",
    "# A dictionary mapping family name to id\n",
    "family_id_mapping = {}\n",
    "# A function to get the id given a row\n",
    "def get_family_id(row):\n",
    "    # Find the last name by splitting on a comma\n",
    "    last_name = row[\"Name\"].split(\",\")[0]\n",
    "    # Create the family id\n",
    "    family_id = \"{0}{1}\".format(last_name, row[\"FamilySize\"])\n",
    "    # Look up the id in the mapping\n",
    "    if family_id not in family_id_mapping:\n",
    "        if len(family_id_mapping) == 0:\n",
    "            current_id = 1\n",
    "        else:\n",
    "            # Get the maximum id from the mapping and add one to it if we don't have an id\n",
    "            current_id = (max(family_id_mapping.items(), key=operator.itemgetter(1))[1] + 1)\n",
    "        family_id_mapping[family_id] = current_id\n",
    "    return family_id_mapping[family_id]\n",
    "\n",
    "# Get the family ids with the apply method\n",
    "family_ids = titanic.apply(get_family_id, axis=1)\n",
    "\n",
    "# There are a lot of family ids, so we'll compress all of the families under 3 members into one code.\n",
    "family_ids[titanic[\"FamilySize\"] < 3] = -1\n",
    "\n",
    "# Print the count of each unique id.\n",
    "print(pd.value_counts(family_ids))\n",
    "\n",
    "titanic[\"FamilyId\"] = family_ids"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 找出最好的特征\n",
    "- 一种方法就是使用单特征选择器(univariate feature selection),这种方法的本质是一列一列的遍历计算出和我们想要预测的结果(Survived)最密切关联的那一列。\n",
    "- sklearn有一个叫做SelectKBest的函数将会帮助我们完成特征选择。这个函数会从数据中选出最好的特征，并且允许我们指定选择的数量。\n",
    "- 我们已经更新predictors。为titanic数据框做用3折交叉验证预测。用predictors预测Survived列。将结果赋值给scores。\n",
    "- 在做完交叉验证预测之后，将我们的scores的平均值打印出来。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "from sklearn.feature_selection import SelectKBest, f_classif\n",
    "from sklearn.model_selection import cross_val_score\n",
    "\n",
    "predictors = [\"Pclass\", \"Sex\", \"Age\", \"SibSp\", \"Parch\", \"Fare\", \"Embarked\", \"FamilySize\", \"Title\", \"FamilyId\"]\n",
    "\n",
    "# Perform feature selection\n",
    "selector = SelectKBest(f_classif, k=5)\n",
    "selector.fit(titanic[predictors], titanic[\"Survived\"])\n",
    "\n",
    "# Get the raw p-values for each feature, and transform from p-values into scores\n",
    "scores = -np.log10(selector.pvalues_)\n",
    "\n",
    "# Plot the scores.  See how \"Pclass\", \"Sex\", \"Title\", and \"Fare\" are the best?\n",
    "plt.bar(range(len(predictors)), scores)\n",
    "plt.xticks(range(len(predictors)), predictors, rotation='vertical')\n",
    "plt.show()\n",
    "\n",
    "# Pick only the four best features.\n",
    "predictors = [\"Pclass\", \"Sex\", \"Fare\", \"Title\"]\n",
    "\n",
    "alg = RandomForestClassifier(random_state=1, n_estimators=150, min_samples_split=8, min_samples_leaf=4)\n",
    "# Compute the accuracy score for all the cross validation folds.  (much simpler than what we did before!)\n",
    "scores = cross_val_score(alg, titanic[predictors], titanic[\"Survived\"], cv=3)\n",
    "\n",
    "# Take the mean of the scores (because we have one for each fold)\n",
    "print(scores.mean())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 梯度提升(Gradient Boosting)\n",
    "- 另外一种方法是以决策树为基础的梯度提升分类器。提升包含了一个接一个的训练决策树，并且将一个树的误差传入到下一棵树。所以每一颗树都是以它之前的所有树为基础构造的。不过如果我们建造太多的树会导致过拟合。当你得到了100棵左右的树，这会非常容易过拟合和训练出数据集中的怪特征。当我们的数据集极小时，我们将会把树的数量限制在25。\n",
    "- 另外一种防止过拟合的方法就是限制在梯度提升过程中建立的每一棵树的深度。我们将树的高度限制到3来分避免过拟合。\n",
    "- 我们将试图用提升来替换掉我们的随机森林方法并且观察是否会提升我们的预测准确率。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 集成(Ensembling)\n",
    "- 为了提升我们的预测准确率，我们能做的一件事就是集成不同的分类器。集成的意思就是我们利用一系列的分类器的信息来生成预测结果而不是仅仅用一个。在实践中，这意味着我们是求他们预测结果的平均值。\n",
    "- 通常来说，我们集成越多的越不同的模型，我们结果的准确率就会越高。多样性的意思是模型从不同列生成结果，或者使用一个非常不同的方法来生成预测结果。集成一个随机森林和一个决策树大概不会得到一个非常好的结果，因为他们非常的相似。换句话说，集成一个线性回归和一个随机森林可以工作得非常棒。\n",
    "- 一个关于集成的警示就是我们使用的分类器的准确率必须都是差不多的。集成一个分类器的准确率比另外一个差得多将会导致最后的结果也变差。\n",
    "- 在这一节中，我们将会集成基于大多数线性预测训练的逻辑回归(有一个线性排序，和Survived有一些关联)和一个在所有预测元素上训练的梯度提升树。\n",
    "- 在我们集成的时候会保持事情的简单——我们将会求我们从分类器中得到的行概率(0~1)的平均值，然后假定所有大于0.5的匹配成1，小于等于0.5的匹配成0。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.ensemble import GradientBoostingClassifier\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.model_selection import KFold\n",
    "import numpy as np\n",
    "\n",
    "# The algorithms we want to ensemble.\n",
    "# We're using the more linear predictors for the logistic regression, and everything with the gradient boosting classifier.\n",
    "algorithms = [\n",
    "    [GradientBoostingClassifier(random_state=1, n_estimators=25, max_depth=3), [\"Pclass\", \"Sex\", \"Age\", \"Fare\", \"Embarked\", \"FamilySize\", \"Title\", \"FamilyId\"]],\n",
    "    [LogisticRegression(random_state=1), [\"Pclass\", \"Sex\", \"Fare\", \"FamilySize\", \"Title\", \"Age\", \"Embarked\"]]\n",
    "]\n",
    "\n",
    "# Initialize the cross validation folds\n",
    "folder = KFold(n_splits=3, random_state=1)\n",
    "\n",
    "predictions = []\n",
    "for train, test in folder.split(titanic['Age']):\n",
    "    train_target = titanic[\"Survived\"].iloc[train]\n",
    "    full_test_predictions = []\n",
    "    # Make predictions for each algorithm on each fold\n",
    "    for alg, predictors in algorithms:\n",
    "        # Fit the algorithm on the training data.\n",
    "        alg.fit(titanic[predictors].iloc[train,:], train_target)\n",
    "        # Select and predict on the test fold.  \n",
    "        # The .astype(float) is necessary to convert the dataframe to all floats and avoid an sklearn error.\n",
    "        test_predictions = alg.predict_proba(titanic[predictors].iloc[test,:].astype(float))[:,1]\n",
    "        full_test_predictions.append(test_predictions)\n",
    "    # Use a simple ensembling scheme -- just average the predictions to get the final classification.\n",
    "    test_predictions = (full_test_predictions[0] + full_test_predictions[1]) / 2\n",
    "    # Any value over .5 is assumed to be a 1 prediction, and below .5 is a 0 prediction.\n",
    "    test_predictions[test_predictions <= .5] = 0\n",
    "    test_predictions[test_predictions > .5] = 1\n",
    "    predictions.append(test_predictions)\n",
    "\n",
    "# Put all the predictions together into one array.\n",
    "predictions = np.concatenate(predictions, axis=0)\n",
    "\n",
    "# Compute accuracy by comparing to the training data.\n",
    "accuracy = sum(predictions[predictions == titanic[\"Survived\"]]) / len(predictions)\n",
    "print(accuracy)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 在测试集上匹配我们的变化\n",
    "- 生成NameLength,表示名字的长度。\n",
    "- 生成FamilySize,表示家庭的大小。\n",
    "- 添加Title列，保持和我们之前做的匹配一样。\n",
    "- 添加FamilyId列，测试数据集合训练数据集的id保持一致。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "titanic_test = pd.read_csv('Data/test.csv')\n",
    "\n",
    "# 处理测试集\n",
    "titanic_test = pd.read_csv(\"Data/test.csv\")\n",
    "titanic_test['Age'] = titanic_test['Age'].fillna(titanic_test['Age'].median())\n",
    "titanic_test['Fare'] = titanic_test['Fare'].fillna(titanic_test['Fare'].median())\n",
    "titanic_test.loc[titanic_test['Sex'] == 'male', 'Sex'] = 0\n",
    "titanic_test.loc[titanic_test['Sex'] == 'female', 'Sex'] = 1\n",
    "titanic_test['Embarked'] = titanic_test['Embarked'].fillna('S')\n",
    "titanic_test.loc[titanic_test['Embarked'] == 'S', 'Embarked'] = 0\n",
    "titanic_test.loc[titanic_test['Embarked'] == 'C', 'Embarked'] = 1\n",
    "titanic_test.loc[titanic_test['Embarked'] == 'Q', 'Embarked'] = 2\n",
    "\n",
    "# First, we'll add titles to the test set.\n",
    "titles = titanic_test[\"Name\"].apply(get_title)\n",
    "# We're adding the Dona title to the mapping, because it's in the test set, but not the training set\n",
    "title_mapping = {\"Mr\": 1, \"Miss\": 2, \"Mrs\": 3, \"Master\": 4, \"Dr\": 5, \"Rev\": 6, \"Major\": 7, \"Col\": 7, \"Mlle\": 8, \"Mme\": 8, \"Don\": 9, \"Lady\": 10, \"Countess\": 10, \"Jonkheer\": 10, \"Sir\": 9, \"Capt\": 7, \"Ms\": 2, \"Dona\": 10}\n",
    "for k,v in title_mapping.items():\n",
    "    titles[titles == k] = v\n",
    "titanic_test[\"Title\"] = titles\n",
    "# Check the counts of each unique title.\n",
    "print(pd.value_counts(titanic_test[\"Title\"]))\n",
    "\n",
    "# Now, we add the family size column.\n",
    "titanic_test[\"FamilySize\"] = titanic_test[\"SibSp\"] + titanic_test[\"Parch\"]\n",
    "\n",
    "# Now we can add family ids.\n",
    "# We'll use the same ids that we did earlier.\n",
    "print(family_id_mapping)\n",
    "\n",
    "family_ids = titanic_test.apply(get_family_id, axis=1)\n",
    "family_ids[titanic_test[\"FamilySize\"] < 3] = -1\n",
    "titanic_test[\"FamilyId\"] = family_ids\n",
    "titanic_test[\"NameLength\"] = titanic_test[\"Name\"].apply(lambda x: len(x))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 在测试集上做预测\n",
    "- 通过将预测结果小于等于0.5的转换成0,将大于0.5的转换成1，将所有的结果转换成非0即1。\n",
    "- 然后，用.astype(int)方法将预测结果转换成整型——如果你不这样做，Kaggle将会给你0分。\n",
    "- 最后，生成一个第一列是PassengerId，第二列是Survived(这就是预测结果)。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "predictors = [\"Pclass\", \"Sex\", \"Age\", \"Fare\", \"Embarked\", \"FamilySize\", \"Title\", \"FamilyId\"]\n",
    "\n",
    "algorithms = [\n",
    "    [GradientBoostingClassifier(random_state=1, n_estimators=25, max_depth=3), predictors],\n",
    "    [LogisticRegression(random_state=1), [\"Pclass\", \"Sex\", \"Fare\", \"FamilySize\", \"Title\", \"Age\", \"Embarked\"]]\n",
    "]\n",
    "\n",
    "full_predictions = []\n",
    "for alg, predictors in algorithms:\n",
    "    # Fit the algorithm using the full training data.\n",
    "    alg.fit(titanic[predictors], titanic[\"Survived\"])\n",
    "    # Predict using the test dataset.  We have to convert all the columns to floats to avoid an error.\n",
    "    predictions = alg.predict_proba(titanic_test[predictors].astype(float))[:,1]\n",
    "    full_predictions.append(predictions)\n",
    "\n",
    "# The gradient boosting classifier generates better predictions, so we weight it higher.\n",
    "predictions = (full_predictions[0] * 3 + full_predictions[1]) / 4\n",
    "predictions[predictions <= .5] = 0\n",
    "predictions[predictions > .5] = 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "predictions = predictions.astype(int)\n",
    "submission = pd.DataFrame({\n",
    "        \"PassengerId\": titanic_test[\"PassengerId\"],\n",
    "        \"Survived\": predictions\n",
    "    })\n",
    "submission.to_csv('Data/submission.csv', index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 其他改进\n",
    "## 特征方面\n",
    "- 尝试用和船舱相关的特征。\n",
    "- 观察家庭大小特征是否会有帮助——一个家庭中女性的数量多使全家更可能幸存？\n",
    "- 乘客的国籍能为其幸存提高什么帮助？\n",
    "## 算法方面\n",
    "- 尝试在集成中加入随机森林分类器。\n",
    "- 在这个数据上支持向量机也许会很有效。\n",
    "- 可以试试神经网络。\n",
    "- 提升一个不同的基础匪类器也许会更好。\n",
    "## 集成方法\n",
    "- 多数表决是比概率平均更好的集成方法？"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
